Parameter Choices

#use the corrected cohorts
new_cohort_assignment <- TRUE

#use the new fixed genomes or old or new with filitration that matches old
# "new" or "old" or "new_w_old_filt"
use_fixed_genotypes <- "new" 

#adjust these if you want the relevant chunks to run
rerun_COLONY <- TRUE

#COLONY params for simulations that will actually be used
COLONY_run_length <- 2
COLONY_allele_freq <- NULL # vs "overall" for population averages. 
COLONY_dropout <- 0 
COLONY_error_rate <- 0.02
COLONY_random_seed <- 1

#see scenarios section for parameter choices
rerun_COLONY_all <- FALSE #this is very slow and only really needed for the sensitivity testing stuff, wouldnt normally rerun this

rerun_LDNE <- TRUE

#p_crit choices
p_crit_list <- c(0, 0.02, 0.05) 
p_crit_chosen <- 0.05

include_2014 <- TRUE #will still run 2014 estimates, just not included in the analysis

Na_for_ratios <- 750 #from Bruce et al. (2018)

Overview

This document/repository contains and outlines the re-analysis of the data presented in Davenport et al 2021 Davenport et al. (2021), as re-re-analysed by Andrew Jones.

The authors were made aware of a bug or error in the output-files used in both NeEstimator and Colony to estimate the number of breeders per cohort. This error caused the data to become shuffled. Below, analysis is presented comparing the published data to files made directly from the original raw data. The input files for Nb estimation for NeEstimator and COLONY are re-made using raw data (based on the published loci) and the updated results are presented. The published work further sought to combine estimators of effective size to make reporting and decision making easier - combined estimates of Nb per-cohort. Here using the revised results we also make these re-estimates.

An additional sensitivity analysis is performed on the full dataset to illustrate how the inclusion of different SNPs and individuals can affect final point-estimates of effective size, and highlights the importance of considering confidence intervals (i.e using a 95%CI we expect that 95% of the time the true value of the parameter lies in its interval), as well as various estimators in conservation decisions. S

Background

In Davenport et al., 2021, the effective number of breeders (Nb) was estimated in 4 consecutive cohorts (2010-2013) based on the premise that when a genetic sample contains only individuals from a single age cohort (a group of individuals having the same age-class), then the estimate of effective population size (Ne) from a single-sample estimator in species with overlapping generations corresponds to the effective number of breeders (Nb). For long-lived, iteroparous species such as the white shark, estimates of Nb are generally considered more useful for monitoring as they apply to a single breeding season (rather than needing a sample representative of a generation - an assumption of Ne (see Waples (2022))), and it represents an accessible parameter for monitoring population trends at ecological timescales most relevant to conservation and management needs. Nb is particularly useful in cases where juvenile or YOY samples can be easily collected (i.e. through SMART shark program NSW). Estimating Nb and monitoring its change over time allows timely detection of population trends (decline, restoration, recovery, expansion), even if using as few as four or five consecutive reproductive cycles (Leberg (2005); J. Wang (2005); Luikart et al. (2021)).

Data

The original DArT data for this project can be found in the file /Data/Raw

The results of the reanalysis can be found in /Results.

The filtered list of loci from the published analysis can be found in Data/Raw/Report-DSha18-3402/File_Formats/Genepop_NeEstimator_Input/nswdpi_whiteshark_gp.gen.

The loci names were extracted from this document to ensure the data from the publication are used in the reanalysis here. Readers are referred to the original publication Davenport et al. (2021) for more detailed methods. In summary, a list of loci/SNPs which meet the articles filtering criteria are saved in the ‘.gen’ file. These are used here to re-make the input files for NeEstimator and COLONY from the original raw-data, from which Nb per cohort is re-estimated.The full raw data is also used to re-estimate Nb using a sensitivity analysis of loci/individuals.

The processed Data is found in /Data/processed.

Confirmation of Issue in Original Manuscript

For this section I follow the approach used by Dani as far as was required to confirm the work of Dean and Paul.

Setup data

A bug in an R-package used in the original publication is suspected to have introduced inconsistencies into the output files used in the final analysis.

# read in published strata
strata_info = read.table("Original Project Files/Strata_Sex_Model_COHORT_REGRESSION.tsv", header = T) # used in the published ms 

# read in published genepop file affected by bug - herein refered to as 'orig-genepop'
orig_gp <- adegenet::read.genepop(file = "Original Project Files/File_Formats/Genepop_NeEstimator_Input/nswdpi_whiteshark_gp.gen", ncode = 3L, quiet = TRUE)
orig_gl <- dartR::gi2gl(orig_gp)
## Starting :: 
##  Starting dartR 
##  Starting gi2gl 
## Starting gl.compliance.check 
##   Processing genlight object with SNP data
##   The slot loc.all, which stores allele name for each locus, is empty. 
## Creating a dummy variable (A/C) to insert in this slot. 
##   Checking coding of SNPs
##     SNP data scored NA, 0, 1 or 2 confirmed
##   Checking locus metrics and flags
##   Recalculating locus metrics
##   Checking for monomorphic loci
##     Dataset contains monomorphic loci
##   Checking for loci with all missing data
##     No loci with all missing data detected
##   Checking whether individual names are unique.
##   Checking for individual metrics
##   Warning: Creating a slot for individual metrics
##   Checking for population assignments
##     Population assignments confirmed
##   Spelling of coordinates checked and changed if necessary to 
##             lat/lon
## Completed: gl.compliance.check 
## Completed: :: 
##  Completed: dartR 
##  Completed: gi2gl
# it didnt fix the names, so do it here
orig_gl@loc.names<- sub(".*?-", "", str_replace_all(orig_gl@loc.names,"__","-"))  # change whitelist loci seperators so they are in the fornat needed to match the genlight obj

# 'new_genepop'
# read in orgional data, and use the list of loci to make a new dataset - herein refered to as 'new-genepop'
input_gl = dartR::gl.read.dart("Data/Raw/Report_DSha18-3402_SNP_2_ReLabeled.csv")
## Starting :: 
##  Starting dartR 
##  Starting gl.read.dart 
## Starting utils.read.dart 
##   Topskip not provided.
##  Setting topskip to 6 .
##   Reading in the SNP data
##   Detected 2 row format.
## Number of rows per clone (should be only  2 s): 2 
##  Added the following locus metrics:
## AlleleID AlleleSequence TrimmedSequence Chrom_Shark_Callorhinchus_v1 ChromPos_Shark_Callorhinchus_v1 AlnCnt_Shark_Callorhinchus_v1 AlnEvalue_Shark_Callorhinchus_v1 SNP SnpPosition CallRate OneRatioRef OneRatioSnp FreqHomRef FreqHomSnp FreqHets PICRef PICSnp AvgPIC AvgCountRef AvgCountSnp RepAvg .
## Recognised: 278 individuals and 9841  SNPs in a 2 row format using Data/Raw/Report_DSha18-3402_SNP_2_ReLabeled.csv 
## Completed: utils.read.dart 
## Starting utils.dart2genlight 
## Starting conversion....
## Format is 2 rows.
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using save(object, file="object.rdata")
## Completed: utils.dart2genlight 
##  Data read in. Please check carefully the output above
##   Read depth calculated and added to the locus metrics
##   Minor Allele Frequency (MAF) calculated and added to the locus metrics
##   Recalculating locus metrics provided by DArT (optionally specified)
## Completed: :: 
##  Completed: dartR 
##  Completed: gl.read.dart
# change the GL loci names so they are in a suitable format 
input_gl@loc.names <- sub("^([^-]*-[^-]*).*", "\\1", input_gl@loc.names)
# filter the gl to have the same loci as used in 'orig-gp'
order_loci <- sapply(orig_gl@loc.names, function(x,order){which(order == x)}, order = input_gl@loc.names)
new_gl <- input_gl[,order_loci]
# dartR has a bug where it doesnt fix up other information stored in other
dim(new_gl@other$loc.metrics)
## [1] 3668   24
new_gl@other$loc.metrics <- new_gl@other$loc.metrics[order_loci,]
# match so the files have the same samples, in the same order
order_ind <- sapply(orig_gl@ind.names, function(x,order){which(order == x)}, order = str_replace_all(new_gl@ind.names, "_", "-"))

new_gl <- new_gl[order_ind,]
# check
dim(new_gl@other$loc.metrics)
## [1] 3668   24
# add the population information, also make sure its in the right order
pop = dplyr::left_join(tibble::tibble(TARGET_ID=str_replace_all(new_gl@ind.names, "_", "-")), strata_info, by = "TARGET_ID")
new_gl@pop<- factor(pop$STRATA)
# check 
which(head(str_replace_all(new_gl@ind.names, "_", "-")) ==  head(orig_gl@ind.names))
## [1] 1 2 3 4 5 6
which(head(orig_gl@loc.names) == head(new_gl@loc.names))
## [1] 1 2 3 4 5 6

Quantify differences in missingness between the published and new dataset

# compare the old and the new files for missingness
# print out each genlight to get info on missngness, loci, invidual
orig_gl
##  /// GENLIGHT OBJECT /////////
## 
##  // 241 genotypes,  3,668 binary SNPs, size: 1.2 Mb
##  4506 (0.51 %) missing data
## 
##  // Basic content
##    @gen: list of 241 SNPbin
##    @ploidy: ploidy of each individual  (range: 2-2)
## 
##  // Optional content
##    @ind.names:  241 individual labels
##    @loc.names:  3668 locus labels
##    @loc.all:  3668 alleles
##    @pop: population of each individual (group size range: 1-63)
##    @other: a list containing: loc.metrics  loc.metrics.flags  verbose  history  ind.metrics
# compare the old and the new files for missingness
# print out each genlight to get info on missngness, loci, invidual
new_gl
##  ********************
##  *** DARTR OBJECT ***
##  ********************
## 
##  ** 241 genotypes,  3,668 SNPs , size: 5.7 Mb
## 
##     missing data: 4503 (=0.51 %) scored as NA
## 
##  ** Genetic data
##    @gen: list of 241 SNPbin
##    @ploidy: ploidy of each individual  (range: 2-2)
## 
##  ** Additional data
##    @ind.names:  241 individual labels
##    @loc.names:  3668 locus labels
##    @loc.all:  3668 allele labels
##    @position: integer storing positions of the SNPs [within 69 base sequence]
##    @pop: population of each individual (group size range: 1-63)
##    @other: a list containing: loc.metrics, ind.metrics, loc.metrics.flags, verbose, history 
##     @other$ind.metrics: service, plate_location 
##     @other$loc.metrics: AlleleID, AlleleSequence, TrimmedSequence, Chrom_Shark_Callorhinchus_v1, ChromPos_Shark_Callorhinchus_v1, AlnCnt_Shark_Callorhinchus_v1, AlnEvalue_Shark_Callorhinchus_v1, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth 
##    @other$latlon[g]: no coordinates attached

These results show the number of individuals and the number of loci does not change between the published and the new dataset. However, the number of missing data in the published dataset is 4506, while the re-analysis dataset missingingess is 4503.

Visualise the difference between the published and new dataset

# visualize the data
ogplot = adegenet::glPlot(orig_gl, posi="topleft")

newplot = adegenet::glPlot(new_gl, posi="topleft")

Discussion of error

More details available in Dani’s report and Dean and Paul’s report.

Main points:

  • Error definitely present as described in analyis by Dean and Paul.
  • Error changes data in a way that would be expected to have some impact estimates of Nb
  • Error needs to be corrected

Corrected Ne Estimates

New cohort assignment

The new cohort assignment file “Data/Processed/New_Cohort_Assignment.tsv” is created by running the the following chunks.

The von Bertanlaffy parameters used are from are from O’Connor (2011). Further work could be done exploring the impact of parameter choices, however initial work suggests the estimates are fairly robust to small numbers of individuals being allocated to different cohorts.

meta_data <- readr::read_csv("Data/Raw/Copy of UPDTAED LAT LONGS UQ genetics PhD181018 Paul Butcher NSW DPI.csv") %>%
  #select(MBB_Code, FL, TL) %>%
  mutate(date_column = as.Date(`Date tagged`, format="%d/%m/%y"),
         modified_year = year(date_column))

# 1. relationship between TL and FL 
lm1 <- lm(meta_data$TL ~ meta_data$FL)

Tl_intercept <- lm1$coefficients[1]
Tl_slope <- lm1$coefficients[2]
#
Tl_intercept <- floor(Tl_intercept * 100) / 100
Tl_slope <- floor(Tl_slope * 100) / 100 # eq (1) manuscript , round down to two decimal places

TL <- sapply(meta_data$FL, function(x) {Tl_intercept + x * Tl_slope}) # eq (1) manuscript 
# Estimate age directly using values from O'Connor 2011 and FL
#solve for age
vbgf3 <- function(L, L_infinity, K_value, t_0) {
  result <- t_0 - (1 / K_value) * log(1 - (L / L_infinity))
  return(result)
}
# Estimating age at length
# info:
# Best fitting parmaters for Males
Linf <- 798.94# cm TL
k <- 0.047 # 
L0 <- 140 #cm 
T0 <- -3.8 #years 
Males <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

# Best fitting paramters for Females 
Linf<- 719.02# cm TL
k<- 0.056 # 
L0<- 140 #cm 
T0 <- -3.8 #years 
Females <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

Linf<- 746.66# cm TL
k<- 0.053# 
L0<- 140 #cm 
T0 <- -3.8 #years 
Combined <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

# female 
idx_f<- which(meta_data$Sex == "F")
idx_m<- which(meta_data$Sex == "M")

male_t = sapply(TL[idx_m], function(x) {vbgf3(x, Males[["L_infinity_value"]], Males[["K_value"]], Males[["t_0_value"]])})
female_t = sapply(TL[idx_f], function(x) {vbgf3(x, Females[["L_infinity_value"]], Females[["K_value"]], Females[["t_0_value"]])})

# Cohort is equal to year sampled minus age
male_age = trunc(meta_data$modified_year[idx_m] -  male_t)
female_age = trunc(meta_data$modified_year[idx_f] -  female_t) # including some samples, ie. MBB1348 depends on how you round the age or the year
output = tibble::tibble(STRATA = c(male_age, female_age), MBB_CODE = c(meta_data$MBB_Code[idx_m], meta_data$MBB_Code[idx_f]))
readr::write_tsv(output, "Data/Processed/New_Cohort_Assignment.tsv")

New Input files for NeEstimator and COLONY

Below the input files for NeEstimator and COLONY are remade to make new estimates of Nb using a new file made from the published list of loci (“new_gp”).

dartR is used write out a genepop file, and a custom script based on the radiator package is used to write out a COLONY file. These COLONY files also contain the COLONY parameter settings.

# 1. write out new_gp as a genepop , check that the file has been written out correctly 
# 1a. run this file through NeEstimator

# first its important to make sure that the cohort are correctly assigned 

#some sharks no cohort - why ?no meta data?

if(new_cohort_assignment){
  cohort_assignment_file <- "Data/Processed/New_Cohort_Assignment.tsv"
  
  cohort_assignment = readr::read_tsv(cohort_assignment_file) %>% #New cohort assignment
  separate(MBB_CODE, into = c("string1", "string2"), sep = " ", remove = FALSE) %>%
  unite(MBB_CODE_JOIN, string1, string2, sep = "_") 
  
}else{
  cohort_assignment_file <- "Original Project Files/Strata_Sex_Model_COHORT_REGRESSION.tsv"
  
  cohort_assignment = readr::read_tsv(cohort_assignment_file) %>% #New cohort assignment
  separate(INDIVIDUALS, into = c("string1", "string2"), sep = "-", remove = FALSE) %>%
  unite(MBB_CODE_JOIN, string1, string2, sep = "_") 
}


if(use_fixed_genotypes == "old"){
  gl_file <- orig_gl
}else if(use_fixed_genotypes == "new_w_old_filt"){
  gl_file <- new_gl
}else if(use_fixed_genotypes == "new"){
  source("./Scripts/filter_script.R")
  gl_file <- readRDS('./Data/Processed/filtered_genotypes.RData')
}else{
  message("need to choose one of old, new, or new_w_old_filt for use_fixed_genotypes")
}
## Starting gl.read.dart 
## Starting utils.read.dart 
##   Topskip not provided.
##  Setting topskip to 6 .
##   Reading in the SNP data
##   Detected 2 row format.
## Number of rows per clone (should be only  2 s): 2 
##  Added the following locus metrics:
## AlleleID AlleleSequence TrimmedSequence Chrom_Shark_Callorhinchus_v1 ChromPos_Shark_Callorhinchus_v1 AlnCnt_Shark_Callorhinchus_v1 AlnEvalue_Shark_Callorhinchus_v1 SNP SnpPosition CallRate OneRatioRef OneRatioSnp FreqHomRef FreqHomSnp FreqHets PICRef PICSnp AvgPIC AvgCountRef AvgCountSnp RepAvg .
## Recognised: 278 individuals and 9841  SNPs in a 2 row format using ./Data/Raw/Report_DSha18-3402_SNP_2_ReLabeled.csv 
## Completed: utils.read.dart 
## Starting utils.dart2genlight 
## Starting conversion....
## Format is 2 rows.
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using save(object, file="object.rdata")
## Adding individual metrics: ./Data/Raw/Davenport_Dart__Report_DSha18_Dups_Relabeled__dartR_METADATA_v1.csv .
##   Ids for individual metadata (at least a subset of) are matching!
##   Found  278 matching ids out of 278 ids provided in the ind.metadata file.
##   Added population assignments.
## Warning: Individual metrics do not include a latitude (lat) column
## Warning: Individual metrics do not include a longitude (lon) column
##  Added  id  to the other$ind.metrics slot.
##   Added  pop  to the other$ind.metrics slot.
##   Added  id_alt  to the other$ind.metrics slot.
##   Added  yob_dp  to the other$ind.metrics slot.
##   Added  yob_db  to the other$ind.metrics slot.
##   Added  comment_1  to the other$ind.metrics slot.
##   Added  id_rec_dsha  to the other$ind.metrics slot.
## Completed: utils.dart2genlight 
##  Data read in. Please check carefully the output above
##   Read depth calculated and added to the locus metrics
##   Minor Allele Frequency (MAF) calculated and added to the locus metrics
##   Recalculating locus metrics provided by DArT (optionally specified)
## Completed: gl.read.dart 
## Starting gl.drop.pop 
## [dartR vers. 2.9.7 Build = v.2023.1 ]
##   Processing genlight object with SNP data
##   Checking for presence of nominated populations, deleting them
##   Deleting populations SA, SAFRICA, WA 
##   Deleting monomorphic loc
##   Recalculating locus metrics
##   Summary of recoded dataset
##     No. of loci: 9655 
##     No. of individuals: 245 
##     Original no. of populations 4 
##     No. of populations remaining:  1 
## Completed: gl.drop.pop 
## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing individuals based on Call Rate, threshold = 0.85 
##   Individuals deleted (CallRate <=  0.85 ):
## MBB_1348[NSW], MBB_1336[NSW], MBB_1431[NSW],
## Starting gl.recalc.metrics 
##   Processing genlight object with SNP data
## Starting utils.recalc.avgpic 
##   Processing genlight object with SNP data
##   Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp, AvgPIC
## Completed: utils.recalc.avgpic 
## Starting utils.recalc.callrate 
##   Processing genlight object with SNP data
##   Recalculating locus metric CallRate
## Completed: utils.recalc.callrate 
## Starting utils.recalc.maf 
##   Processing genlight object with SNP data
##   Recalculating FreqHoms and FreqHets
## Starting utils.recalc.freqhets 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHets
## Completed: utils.recalc.freqhets 
## Starting utils.recalc.freqhomref 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomRef
## Completed: utils.recalc.freqhomref 
## Starting utils.recalc.freqhomsnp 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomSnp
## Completed: utils.recalc.freqhomsnp 
##   Recalculating Minor Allele Frequency (MAF)
## Completed: utils.recalc.maf 
##   Locus metrics recalculated
## Completed: gl.recalc.metrics 
##   Note: Locus metrics recalculated
##   Note: Resultant monomorphic loci not deleted
## Completed: gl.filter.callrate 
## Starting gl.filter.monomorphs 
##   Processing genlight object with SNP data
##   Identifying monomorphic loci
##   Removing monomorphic loci and loci with all missing 
##                        data
## Completed: gl.filter.monomorphs 
## Starting gl.filter.reproducibility 
##   Processing genlight object with SNP data
##   Removing loci with repeatability less than 0.98 
## Completed: gl.filter.reproducibility 
## Starting gl.filter.monomorphs 
##   Processing genlight object with SNP data
##   Identifying monomorphic loci
##   No monomorphic loci to remove
## Completed: gl.filter.monomorphs 
## Starting gl.drop.loc 
## [dartR vers. 2.9.7 Build = v.2023.1 ]
##   Processing genlight object with SNP data
##   List of loci to drop has been specified
##   Deleting the specified loci
##   Summary of recoded dataset
##     Original No. of loci: 6560 
##     No. of loci deleted: 579 
##     No. of loci retained: 5981 
## Completed: gl.drop.loc 
## Starting gl.filter.rdepth 
## [dartR vers. 2.9.7 Build = Jody ]
##   Processing genlight object with SNP data
##   Removing loci with rdepth <= 5 and >= 25 
##   Summary of filtered dataset
##     Initial no. of loci = 5981 
##     No. of loci deleted = 623 
##     No. of loci retained: 5358 
##     No. of individuals: 242 
##     No. of populations:  1 
## Completed: gl.filter.rdepth 
## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing loci based on Call Rate, threshold = 0.75 
## Completed: gl.filter.callrate 
## Starting gl.filter.secondaries 
##   Processing genlight object with SNP data
##   Selecting one SNP per sequence tag at random
## Completed: gl.filter.secondaries 
## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing individuals based on Call Rate, threshold = 0.85 
##   Note: Locus metrics recalculated
##   Note: Resultant monomorphic loci not deleted
## Completed: gl.filter.callrate 
## Starting gl.drop.ind 
## [dartR vers. 2.9.7 Build = v.2023.1 ]
##   Processing genlight object with SNP data
##   Deleting specified individuals
##    MBB_1341_Dup, MBB_1372, MBB_1417, MBB_1455, MBB_1483_Dup, MBB_1544 
##   Deleting monomorphic loc
##   Recalculating locus metrics
## Summary of recoded dataset
##   No. of loci: 4937 
##   Original No. of individuals: 242 
##   No. of individuals: 236 
##   Original No. of populations: 1 
##   No. of populations:  1 
## Completed: gl.drop.ind 
## Starting gl.filter.hwe 
##   Processing genlight object with SNP data
##   Pooling all populations for HWE calculations
##   Loci examined: 4937 
##   Deleted 673 loci with significant departure from HWE at alpha = 0.01 applied locus by locus
##   Loci retained: 4264 
## 
##     Adjustment of p-values for multiple comparisons vary with 
##                 sample size
## Completed: gl.filter.hwe 
## 
## Starting gl.basic.stats 
##   Processing genlight object with SNP data
## Completed: gl.basic.stats 
## Starting gl.drop.loc 
## [dartR vers. 2.9.7 Build = v.2023.1 ]
##   Processing genlight object with SNP data
##   List of loci to drop has been specified
##   Deleting the specified loci
##   Warning: no loci listed to delete! Genlight object returned unchanged
##   Summary of recoded dataset
##     Original No. of loci: 4264 
##     No. of loci deleted: 0 
##     No. of loci retained: 4264 
## Completed: gl.drop.loc
gl_file$ind.names <- str_replace(gl_file$ind.names, "-", "_")

ind_order = tibble::tibble(MBB_CODE_JOIN = gl_file$ind.names, OLD_STRATA = gl_file@pop) %>%
  left_join(., cohort_assignment, by = "MBB_CODE_JOIN")

gl_file@pop <- as.factor(ind_order$STRATA)
table(gl_file@pop)
## 
## 2005 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 
##    1    3    5   11   29   40   51   62   23    9    2
dartR::gl2genepop(gl_file, outfile = "genotypes_for_Ne.gen", outpath = "Data/Processed/NeEstimator")
## Starting :: 
##  Starting dartR 
##  Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  Data/Processed/NeEstimator/genotypes_for_Ne.gen/
## Completed: :: 
##  Completed: dartR 
##  Completed: gl2genepop

A number of different parameters setttings were used.

For error and drop out rate, the default settings, as were used originally are based on the recommendations of Jinliang Wang (2004). More recently Jinliang Wang (2018) has done further empirical work, and the parameters were also tried as well as some inbetween.

For each of the error and drop out rate options, a version was run with population allele frequencies only based on the cohort (allele.freq = NULL, default setting) and using estimates based on the entire population (allele_freq = “overall”).

The run length parameter was also experimented with.

Run COLONY

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  6484759 346.4   12354135 659.8 12354135 659.8
## Vcells 23386055 178.5   79584701 607.2 79584385 607.2
#need to read COLONY data in from files

colony_results_files <- list.files("./Results/COLONY/")
colony_results_files <- colony_results_files[str_detect(colony_results_files, "^COLONY_output_scenario_[a-zA-Z0-9]+_[0-9]{4}.txt$")]

col_list <- list()

for (i in seq_len(length(colony_results_files))){
  c_r_f <- colony_results_files[i]
  res <- readLines(file.path("./Results/COLONY/", c_r_f))
  temp <- str_match(c_r_f, "^COLONY_output_(scenario_[a-zA-Z0-9]+)_([0-9]{4}).txt$")
  
  Year <- as.numeric(temp[3])
  Scenario <- temp[2]
  Nb_SA_est <- as.numeric(na.omit(str_match(res, "^Ne\\s+=\\s+([0-9]+)$")[, 2])[1]) 
  Nb_SA_upper <- as.numeric(na.omit(str_match(res, "^CI95\\(U\\)\\s+=\\s+([0-9]+)$")[, 2])[1]) 
  Nb_SA_lower <- as.numeric(na.omit(str_match(res, "^CI95\\(L\\)\\s+=\\s+([0-9]+)$")[, 2])[1]) 
  
  
  if(file.exists(file.path("./Results/COLONY",  str_replace(c_r_f, ".txt", "_FS.txt")))){
    #counts of full and half sibs
    res_FS <- readLines(file.path("./Results/COLONY",  str_replace(c_r_f, ".txt", "_FS.txt")))
    FS_count_Nb_SA <- length(res_FS)-1
  }else{
    FS_count_Nb_SA <- NA
  }
  
  if(file.exists(file.path("./Results/COLONY",  str_replace(c_r_f, ".txt", "_HS.txt")))){
    res_HS <- readLines(file.path("./Results/COLONY/",  str_replace(c_r_f, ".txt", "_HS.txt")))
    HS_count_Nb_SA <- length(res_HS)-1
  }else{
    HS_count_Nb_SA <- NA
  }
  
  col_list[[i]] <- data.frame(Year, Scenario, Nb_SA_est, Nb_SA_lower, Nb_SA_upper, FS_count_Nb_SA, HS_count_Nb_SA)
}

All_COLONY_data <- bind_rows(col_list) %>% filter(Year >= 2010, Year <= 2014)

write_csv(All_COLONY_data, "./Results/COLONY/ResultsSummary.csv")

Scenario List:

  • scenario_00 - Default settings, short run
  • scenario_01 - Default settings, population allele frequencies, short run
  • scenario_02 - Error rates from Wang 2018, short run
  • scenario_03 - Error rates from Wang 2018, population allele frequencies, short run
  • scenario_04 - Error rates inbetween 1, short run
  • scenario_05 - Error rates inbetween 1, population allele frequencies, short run
  • scenario_06 - Error rates inbetween 2, short run
  • scenario_07 - Error rates inbetween 2, population allele frequencies, short run
  • scenario_08 - Default settings, long run, repeat - Settings from original paper
  • scenario_09 - Default settings, long run, repeat
  • scenario_10 - Default settings, medium run, repeat
  • scenario_11 - Default settings, medium run, repeat
ggplot(All_COLONY_data, aes(x = Year, y = Nb_SA_est, colour = Scenario, group=Scenario)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_SA_lower, ymax = Nb_SA_upper), width = width, position =dodge) 

All_COLONY_data %>% filter(Scenario %in% c("scenario_00", "scenario_01", "scenario_02", "scenario_03","scenario_08", "scenario_final")) %>% ggplot( aes(x = Year, y = Nb_SA_est, colour = Scenario, group=Scenario)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_SA_lower, ymax = Nb_SA_upper), width = width, position =dodge) 

Nb_SA_data <- All_COLONY_data %>% filter(Scenario == "scenario_final") #need to pick one(1) for example scenario_00 is the defaults with 'quick run', scenario_08 is the original settings
write_csv(Nb_SA_data, "./Results/COLONY/ResultsSummary_final.csv")

Estimates slightly lower and more consistent for some years if use population allele frequencies, otherwise basically same. The “short” runs have a degree of inconsistency between runs, but main results will use “medium”. Overall, there is no real reason to deviate from original settings.

Run NeEstimator

## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2010_MBB`
##                     Statistic Frequency 1
##  Lowest Allele Frequency Used          0+
##     Harmonic Mean Sample Size        28.7
##       Independent Comparisons     6635268
##                   OverAll r^2    0.042079
##           Expected r^2 Sample    0.038692
##                 Estimated Ne^        89.2
##             CI low Parametric          88
##            CI high Parametric        90.5
##              CI low JackKnife          49
##             CI high JackKnife       327.6
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2011_MBB`
##                     Statistic Frequency 1
##  Lowest Allele Frequency Used          0+
##     Harmonic Mean Sample Size        39.5
##       Independent Comparisons     7478193
##                   OverAll r^2    0.028462
##           Expected r^2 Sample    0.027345
##                 Estimated Ne^       296.3
##             CI low Parametric       288.8
##            CI high Parametric       304.2
##              CI low JackKnife       160.3
##             CI high JackKnife      1474.4
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2012_MBB`
##                     Statistic Frequency 1
##  Lowest Allele Frequency Used          0+
##     Harmonic Mean Sample Size        50.4
##       Independent Comparisons     7911821
##                   OverAll r^2    0.022295
##           Expected r^2 Sample    0.021063
##                 Estimated Ne^       268.6
##             CI low Parametric       263.9
##            CI high Parametric       273.5
##              CI low JackKnife       161.9
##             CI high JackKnife       705.1
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2013_MBB`
##                     Statistic Frequency 1
##  Lowest Allele Frequency Used          0+
##     Harmonic Mean Sample Size        61.5
##       Independent Comparisons     8365338
##                   OverAll r^2    0.018535
##           Expected r^2 Sample    0.017074
##                 Estimated Ne^       226.1
##             CI low Parametric       223.4
##            CI high Parametric       228.9
##              CI low JackKnife       153.8
##             CI high JackKnife       404.9
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2014_MBB`
##                     Statistic Frequency 1
##  Lowest Allele Frequency Used          0+
##     Harmonic Mean Sample Size        22.8
##       Independent Comparisons     6333357
##                   OverAll r^2    0.051931
##           Expected r^2 Sample     0.04997
##                 Estimated Ne^       155.3
##             CI low Parametric       150.9
##            CI high Parametric         160
##              CI low JackKnife        72.3
##             CI high JackKnife         Inf
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2010_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.020          0+
##     Harmonic Mean Sample Size        28.7        28.7
##       Independent Comparisons     4886406     6635268
##                   OverAll r^2    0.043025    0.042079
##           Expected r^2 Sample     0.03868    0.038692
##                 Estimated Ne^        69.2        89.2
##             CI low Parametric        68.3          88
##            CI high Parametric          70        90.5
##              CI low JackKnife        37.6          49
##             CI high JackKnife       245.6       327.6
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2011_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.020          0+
##     Harmonic Mean Sample Size        39.5        39.5
##       Independent Comparisons     5912021     7478193
##                   OverAll r^2    0.028652    0.028462
##           Expected r^2 Sample    0.027329    0.027345
##                 Estimated Ne^       249.9       296.3
##             CI low Parametric       243.8       288.8
##            CI high Parametric       256.2       304.2
##              CI low JackKnife       133.5       160.3
##             CI high JackKnife      1317.4      1474.4
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2012_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.020          0+
##     Harmonic Mean Sample Size        50.4        50.4
##       Independent Comparisons     5418516     7911821
##                   OverAll r^2    0.022649    0.022295
##           Expected r^2 Sample    0.021076    0.021063
##                 Estimated Ne^       209.9       268.6
##             CI low Parametric       206.4       263.9
##            CI high Parametric       213.6       273.5
##              CI low JackKnife       126.2       161.9
##             CI high JackKnife       542.6       705.1
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2013_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.020          0+
##     Harmonic Mean Sample Size        61.6        61.5
##       Independent Comparisons     5960243     8365338
##                   OverAll r^2     0.01883    0.018535
##           Expected r^2 Sample    0.017064    0.017074
##                 Estimated Ne^       186.7       226.1
##             CI low Parametric       184.4       223.4
##            CI high Parametric         189       228.9
##              CI low JackKnife       126.8       153.8
##             CI high JackKnife       332.5       404.9
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2014_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.020          0+
##     Harmonic Mean Sample Size        22.8        22.8
##       Independent Comparisons     6333357     6333357
##                   OverAll r^2    0.051931    0.051931
##           Expected r^2 Sample     0.04997     0.04997
##                 Estimated Ne^       155.3       155.3
##             CI low Parametric       150.9       150.9
##            CI high Parametric         160         160
##              CI low JackKnife        72.3        72.3
##             CI high JackKnife         Inf         Inf
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2010_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size        28.8        28.7
##       Independent Comparisons     3540661     6635268
##                   OverAll r^2    0.043441    0.042079
##           Expected r^2 Sample     0.03864    0.038692
##                 Estimated Ne^        62.4        89.2
##             CI low Parametric        61.6          88
##            CI high Parametric        63.3        90.5
##              CI low JackKnife          33          49
##             CI high JackKnife       246.8       327.6
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2011_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size        39.5        39.5
##       Independent Comparisons     3625872     7478193
##                   OverAll r^2    0.028835    0.028462
##           Expected r^2 Sample    0.027298    0.027345
##                 Estimated Ne^       214.8       296.3
##             CI low Parametric         209       288.8
##            CI high Parametric       220.9       304.2
##              CI low JackKnife       113.9       160.3
##             CI high JackKnife      1153.1      1474.4
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2012_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size        50.5        50.4
##       Independent Comparisons     3025517     7911821
##                   OverAll r^2    0.022664    0.022295
##           Expected r^2 Sample    0.021032    0.021063
##                 Estimated Ne^       202.2       268.6
##             CI low Parametric       197.8       263.9
##            CI high Parametric       206.8       273.5
##              CI low JackKnife       123.8       161.9
##             CI high JackKnife       487.9       705.1
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2013_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size        61.6        61.5
##       Independent Comparisons     2995652     8365338
##                   OverAll r^2    0.018976    0.018535
##           Expected r^2 Sample     0.01707    0.017074
##                 Estimated Ne^       172.8       226.1
##             CI low Parametric       170.1       223.4
##            CI high Parametric       175.7       228.9
##              CI low JackKnife       117.9       153.8
##             CI high JackKnife       303.8       404.9
## 
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`2014_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size        22.8        22.8
##       Independent Comparisons     2831827     6333357
##                   OverAll r^2    0.053037    0.051931
##           Expected r^2 Sample    0.049989     0.04997
##                 Estimated Ne^        99.4       155.3
##             CI low Parametric        96.5       150.9
##            CI high Parametric       102.3         160
##              CI low JackKnife        47.2        72.3
##             CI high JackKnife      4025.9         Inf
Nb_LD_data_all <-  read_csv("./Results/NeEstimator/NeResultsSummary.csv")
Nb_LD_data_all$p_crit <- factor(Nb_LD_data_all$p_crit)

ggplot(Nb_LD_data_all, aes(x = Year, y = Nb_LD_est, colour = p_crit, group=p_crit)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_LD_lower, ymax = Nb_LD_upper), width = width, position =dodge)  + coord_cartesian(ylim = c(0,1000)) + theme_bw()

Nb_LD_data <- Nb_LD_data_all %>% filter(p_crit == p_crit_chosen) #need to pick one(1) 0.05 is the original settings

write_csv(Nb_LD_data, "./Results/NeEstimator/NeResultsSummary_final.csv")

The rare allele bias issue that p_crit fixes is shown here with the different choices for p_crit. Sticking with 0.05 seems like the best option.

Cobmine and Save All Estimates Code

#read in from file so always works even if not a full rerun
Nb_LD_data <- read_csv( "./Results/NeEstimator/NeResultsSummary_final.csv")
Nb_LSA_data <- read_csv( "./Results/COLONY/ResultsSummary_final.csv")

#join and save
data <- left_join(Nb_LD_data, Nb_SA_data, by="Year")
write_csv(data, "./Results/ResultsSummary.csv")

Run Combination Estimates Code

data <- read_csv( "./Results/ResultsSummary.csv")

Ne_to_r <- function(Ne, S){
  r <- (-69*S^2 + sqrt(10000 * (S^4) * (Ne^2) + 4761 * (S^4) - 248400*(S^3)*Ne) + 1800*S*(Ne^2))/( 1800*(S^2)*(Ne^2))  
  return(r)
}

Ne_to_r_star <- function(Ne, S){
  r <- Ne_to_r(Ne, S)
  r <- r - 1/S
  return(r)
}


data$Var_SA_lower <- 0.25 * data$Nb_SA_est^4 * ((1 / data$Nb_SA_lower) - (1 / data$Nb_SA_est))^2
data$Var_SA_upper <- 0.25 * data$Nb_SA_est^4 * ((1 / data$Nb_SA_upper) - (1 / data$Nb_SA_est))^2
data$Var_SA_mean <- ( data$Var_SA_lower + data$Var_SA_upper ) / 2
data$Var_SA_weight <- 1 / data$Var_SA_mean


data$r_LD_lower <- Ne_to_r(data$Nb_LD_lower, data$S) 
data$r_LD_upper <- Ne_to_r(data$Nb_LD_upper, data$S) 
data$r_LD_est <- Ne_to_r(data$Nb_LD_est, data$S) 

data$rstar_LD_lower <- Ne_to_r_star(data$Nb_LD_lower, data$S) 
data$rstar_LD_upper <- Ne_to_r_star(data$Nb_LD_upper, data$S) 


data$Var_LD_lower <- (1/36) * data$r_star^-4 * (data$r_star - data$rstar_LD_lower)^2
data$Var_LD_upper <- (1/36) * data$r_star^-4 * (data$r_star - data$rstar_LD_upper)^2
data$Var_LD_mean <- ( data$Var_LD_lower + data$Var_LD_upper ) / 2
data$Var_LD_weight <- 1 / data$Var_LD_mean

data$weight_sum <- data$Var_LD_weight + data$Var_SA_weight

data$Var_LD_weightadj <- data$Var_LD_weight / data$weight_sum
data$Var_SA_weightadj <- data$Var_SA_weight / data$weight_sum

data$Nb_COM_est <- 1 / ((data$Var_SA_weightadj /data$Nb_SA_est) + (data$Var_LD_weightadj / data$Nb_LD_est))
data$var_COM <- 1 / ((1/data$Var_LD_mean) + (1/data$Var_SA_mean))

#Not CIs
data$Nb_COM_lower <- data$Nb_COM_est - sqrt(data$var_COM) 
data$Nb_COM_upper <- data$Nb_COM_est + sqrt(data$var_COM)

data$Nb_COM_SD <- sqrt(data$var_COM)
data$Na_Nb_ratio <- data$Nb_COM_est / Na_for_ratios

data %>% write_csv("./Results/ResultsComplete.csv")

#nice table
data %>% dplyr::select(Year, starts_with("Nb_")) -> data_est_only

if(!include_2014){
  data_est_only <- data_est_only %>% filter(Year != 2014)
}else{
  data_est_only <- data_est_only
}

data_est_only %>% gt() %>% fmt_number(columns = c(5:10), decimals = 1) 
Year Nb_LD_est Nb_LD_lower Nb_LD_upper Nb_SA_est Nb_SA_lower Nb_SA_upper Nb_COM_est Nb_COM_lower Nb_COM_upper Nb_COM_SD
2010 62.4 33.0 246.8 96.0 60.0 186.0 77.0 57.8 96.3 19.26633
2011 214.8 113.9 1153.1 240.0 151.0 589.0 229.8 173.9 285.7 55.93400
2012 202.2 123.8 487.9 196.0 139.0 323.0 197.1 161.3 232.8 35.74270
2013 172.8 117.9 303.8 158.0 112.0 234.0 159.9 132.8 187.0 27.14253
2014 99.4 47.2 4025.9 127.0 69.0 356.0 110.9 78.1 143.7 32.83795

Results

Summary Plots

Plot of Nb estimates with 95% confidence intervals

Plot of Nb estimates with 95% confidence intervals

The combined estimates are also plotted here. The error bars in this case represent +- SD, not a confidence interval.

Plot of Combined Nb estimates +- SD

Plot of Combined Nb estimates +- SD

Stability

I take stability to mean that there are no significant differences between the Nb estimates for any years, this corresponds to the null hypothesis that Nb_2010 = Nb_2011 = Nb_2012 = Nb_2013.

LD (NeEstimator)

The situation is very difficult to evaluate graphically. No hard conclusions can be drawn from examining the individual CIs alone, however there is no obvious trends and the band of values that are in all confidence intervals is relatively wide.

LD estimates with 95% CIs and values which fall in all CIs marked as a grey rectangle

LD estimates with 95% CIs and values which fall in all CIs marked as a grey rectangle

In order to examine this formally, we will need to construct pairwise tests. This needs to be done with \(r^{*}\) as LDNe confidence intervals do not have normal CIs. The confidence intervals for \(r^{*}\) in LDNe are approximately normal. This method of comparison is statistically consistent with manner used in the construction of the CIs [cite: Jones, Ovenden and Wang (2016)].

A slightly more accurate result could be obtained by pulling the results directly from the LDNe code, and making minor adjustments to account for the fact this is only approximately normal, however I do not believe this would make a difference in this case.

The following table summarises the pairwise comparisons of the Nb estimates in terms of Nb. The first two columns identify which years are being compared, the third and forth columns state the relevant \(r^{*}\) values.

Their difference and the estimate for the standard deviation of the estimate follow. The combined standard deviation is found by taking the square root of the sum the of the two variances. This assumes both estimates are independent. While it is likely these estimates are correlated with each other, we have no way of accurately estimating their covariance so I believe it is best to assume independence. Holm method on the basis that we are making 6 pairwise comparisons and we wish to keep our family-wise error rate below 0.05.

data$rstar_LD_lower <- Ne_to_r_star(data$Nb_LD_lower, data$S) 
data$rstar_LD_upper <- Ne_to_r_star(data$Nb_LD_upper, data$S) 
data$rstar_LD_est <- Ne_to_r_star(data$Nb_LD_est, data$S) 
data$rstar_sd_mean <- (data$rstar_LD_lower - data$rstar_LD_upper) / (2*qnorm(0.9750))

results_frame <- data.frame("Year1" = NA, "Year2"= NA, "r1"= NA, "r2"= NA, "diff"= NA, "combined sd"= NA, "z-score"= NA, "p-value"= NA) %>% na.omit()

for(i in 1:(nrow(data)-1)){
  for(j in (i+1):nrow(data)){
    temp <- c(
      data$Year[i],
      data$Year[j],
      data$rstar_LD_est[i],
      data$rstar_LD_est[j],
      data$rstar_LD_est[i]-data$rstar_LD_est[j],
      sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2), 
      (data$rstar_LD_est[i]-data$rstar_LD_est[j]) / sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2),
      1-pnorm(abs(data$rstar_LD_est[i]-data$rstar_LD_est[j]) / sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2))
      
    )
    
    results_frame[nrow(results_frame) + 1,] <- temp
  
  }
}

results_frame$p.value.adj <- p.adjust(results_frame$p.value, "holm")
results_frame %>% gt() %>% opt_stylize() %>% fmt_number(columns = c(7:9), decimals = 3) %>% fmt_number(columns = c(3:6), n_sigfig =  3)
Year1 Year2 r1 r2 diff combined.sd z.score p.value p.value.adj
2010 2011 0.000874 0.000257 0.000617 0.000375 1.647 0.050 0.498
2010 2012 0.000874 0.000273 0.000601 0.000368 1.634 0.051 0.498
2010 2013 0.000874 0.000320 0.000555 0.000365 1.519 0.064 0.515
2010 2014 0.000874 0.000552 0.000322 0.000460 0.701 0.242 1.000
2011 2012 0.000257 0.000273 −0.0000161 0.000140 −0.115 0.454 1.000
2011 2013 0.000257 0.000320 −0.0000624 0.000133 −0.470 0.319 1.000
2011 2014 0.000257 0.000552 −0.000295 0.000310 −0.952 0.171 1.000
2012 2013 0.000273 0.000320 −0.0000464 0.000112 −0.415 0.339 1.000
2012 2014 0.000273 0.000552 −0.000279 0.000301 −0.925 0.177 1.000
2013 2014 0.000320 0.000552 −0.000232 0.000298 −0.779 0.218 1.000

SA (COLONY)

With the updated results, the intervals for the SA method are no longer mutually disjoint. Analysis is much the same as for LD now.

The analysis proceeds in a similar manner as for \(r^{*}\).

data$Nb_SA_est_inv <- 1 / (2*data$Nb_SA_est)

data$vstar_lower <- ((1/data$Nb_SA_lower - 1/data$Nb_SA_est) / (2*qnorm(0.9750)))^2
data$vstar_upper <- ((1/data$Nb_SA_est - 1/data$Nb_SA_upper) / (2*qnorm(0.9750)))^2

data$vstar_mean <- (data$vstar_upper+data$vstar_lower)/2

results_frame <- data.frame("Year1" = NA, "Year2"= NA, "2Nb^-1 1"= NA, "2Nb^-1 2"= NA, "diff"= NA, "combined sd"= NA, "z-score"= NA, "p-value"= NA) %>% na.omit()

for(i in 1:(nrow(data)-1)){
  for(j in (i+1):nrow(data)){
    temp <- c(
      data$Year[i],
      data$Year[j],
      data$Nb_SA_est_inv[i],
      data$Nb_SA_est_inv[j],
      data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j],
      sqrt(data$vstar_mean[i] + data$vstar_mean[j]), 
      (data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j]) / sqrt(data$vstar_mean[i] + data$vstar_mean[j]),
      1-pnorm(abs(data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j]) / sqrt(data$vstar_mean[i] + data$vstar_mean[j]))
      
    )
    
    results_frame[nrow(results_frame) + 1,] <- temp
  
  }
}

results_frame$p.value.adj <- p.adjust(results_frame$p.value, "holm")

names(results_frame)[3:4] <-c('(2Ne)^-1 1', '(2Ne)^-1 2') 
results_frame %>% gt() %>% opt_stylize() %>% fmt_number(columns = c(6:9), decimals = 3) %>% fmt_number(columns = c(3:5), n_sigfig =  3)
Year1 Year2 (2Ne)^-1 1 (2Ne)^-1 2 diff combined.sd z.score p.value p.value.adj
2010 2011 0.00521 0.00208 0.00312 0.002 1.979 0.024 0.239
2010 2012 0.00521 0.00255 0.00266 0.002 1.726 0.042 0.380
2010 2013 0.00521 0.00316 0.00204 0.002 1.304 0.096 0.768
2010 2014 0.00521 0.00394 0.00127 0.002 0.609 0.271 0.960
2011 2012 0.00208 0.00255 −0.000468 0.001 −0.572 0.284 0.960
2011 2013 0.00208 0.00316 −0.00108 0.001 −1.247 0.106 0.768
2011 2014 0.00208 0.00394 −0.00185 0.002 −1.138 0.128 0.768
2012 2013 0.00255 0.00316 −0.000614 0.001 −0.773 0.220 0.960
2012 2014 0.00255 0.00394 −0.00139 0.002 −0.871 0.192 0.960
2013 2014 0.00316 0.00394 −0.000772 0.002 −0.477 0.317 0.960

I have a number of reservations about this analysis for the COLONY estimates, but I have included this analysis as I believe it is important.

These issues are:

  • I am much less familiar with the CIs for COLONY than for LDNe
  • The accuracy of extracting the variance from the CI bounds was not very high, at the time of the original paper I put this down to the fact it rounds to whole numbers, but I’m not longer sure

Discussion

Plot of Nb estimates with 95% confidence intervals

Plot of Nb estimates with 95% confidence intervals

Requested Outputs

The following outputs have been specifically requested and this section provides extra details on them.

A table comparing the before- and after-correction values for LD and SA Nb, SA Nb and any derived Nb estimates.

A new file ‘/Original Project Files/Results/Original_Results.xlsx’ has been created with exact values from the published manuscript to avoid all doubt about the original numbers.

comparison_variables <- c("Year",   "S",    "Nb_LD_est",    "Nb_LD_lower",  "Nb_LD_upper",  "Nb_SA_est",    "Nb_SA_lower",  "Nb_SA_upper", "Nb_COM_est",    "Nb_COM_SD", "Na_Nb_ratio", "FS_count_Nb_SA",   "HS_count_Nb_SA")

original_data <- read_xlsx('./Original Project Files/Results/Original_Results.xlsx') %>% select(all_of(comparison_variables))
new_data <- read_csv("./Results/ResultsComplete.csv") %>% select(all_of(comparison_variables))

original_data$Era <- "Original"
new_data$Era <- "Reanalysis"

comparison_table <- rbind(original_data, new_data)
comparison_table %>% write_csv("./Results/Comparison_data.csv")
comparison_table %>% gt()
Year S Nb_LD_est Nb_LD_lower Nb_LD_upper Nb_SA_est Nb_SA_lower Nb_SA_upper Nb_COM_est Nb_COM_SD Na_Nb_ratio FS_count_Nb_SA HS_count_Nb_SA Era
2010 29 193.2 91.1 NA 271 136 1430 NA NA NA 2 2 Original
2011 42 195.1 104.2 952.9 344 204 923 233.20000 69.50000 0.3100000 4 4 Original
2012 52 165.6 104.2 359.6 341 157 399 206.10000 45.90000 0.2700000 8 6 Original
2013 63 208.5 116.4 712.7 289 200 461 252.00000 46.70000 0.3400000 8 10 Original
2014 23 521.9 282.2 3112.4 NA NA NA NA NA NA NA NA Original
2010 29 62.4 33.0 246.8 96 60 186 77.01947 19.26633 0.1026926 7 3 Reanalysis
2011 40 214.8 113.9 1153.1 240 151 589 229.81139 55.93400 0.3064152 4 5 Reanalysis
2012 51 202.2 123.8 487.9 196 139 323 197.06175 35.74270 0.2627490 7 13 Reanalysis
2013 62 172.8 117.9 303.8 158 112 234 159.90230 27.14253 0.2132031 16 25 Reanalysis
2014 23 99.4 47.2 4025.9 127 69 356 110.90644 32.83795 0.1478753 3 2 Reanalysis

Text, plots and/ or statistics that support the “stable Nb over X years” conclusion in the paper.

See earlier parts of this report. More can be added if required.

The final LD Ne datafile used to make the corrected estimates of Nb.

This can be found in /Data/Processed/NeEstimator/genotypes_for_Ne.gen

List of loci used also saved to /Data/Processed/loci_names.csv

gl_file@loc.names %>% as.data.frame() %>% write_csv("./Data/Processed/loci_names.csv")

The overall LD Ne estimate.

This estimate should be treated with extreme caution when being interepreted.

gl_file_temp <- gl.merge.pop(gl_file, old =  levels(gl_file@pop), new = "one-big-pop")
## Starting gl.merge.pop 
##   Processing genlight object with SNP data
##   Merging a list of populations into one
##   Merging 2005 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 into one-big-pop 
## Completed: gl.merge.pop
gl.LDNe(
        gl_file_temp,
        neest.path = "./Executables/NeEstimator",
        critical = p_crit_chosen ,
        singleton.rm = FALSE,
        mating = "random",
        plot.out = FALSE,
        verbose = 0)
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  /tmp/RtmpQVFQIe/dummy.gen/
## Completed: gl2genepop 
## $`one-big-pop_MBB`
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used       0.050          0+
##     Harmonic Mean Sample Size         234       233.6
##       Independent Comparisons     3177911     9088639
##                   OverAll r^2    0.005865    0.005685
##           Expected r^2 Sample    0.004329    0.004336
##                 Estimated Ne^       214.9       244.9
##             CI low Parametric       213.6       243.9
##            CI high Parametric       216.2       245.8
##              CI low JackKnife       183.4       211.2
##             CI high JackKnife         256       288.3

The data and scripts required for the groups, reviewers, and readers to easily recreate these results

This markdown file, the associated R scripts (colony_script.R, colony_script2.R, filter_script.R ), the files in /Data/Raw/ and the executables for COLONY and NeEstimator are all that’s required to reproduce this analysis in full.

The data filtering and analyses required to produce the LD Ne genepop datafile used to make the file estimates of Nb

For clarity this code is reproduced here in a single block

#first we need the cohorts

# Estimate age directly using values from O'Connor 2011 and FL
#solve for age
vbgf3 <- function(L, L_infinity, K_value, t_0) {
  result <- t_0 - (1 / K_value) * log(1 - (L / L_infinity))
  return(result)
}

# Estimating age at length
# info:
# Best fitting parmaters for Males
Linf <- 798.94# cm TL
k <- 0.047 # 
L0 <- 140 #cm 
T0 <- -3.8 #years 
Males <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

# Best fitting paramters for Females 
Linf<- 719.02# cm TL
k<- 0.056 # 
L0<- 140 #cm 
T0 <- -3.8 #years 
Females <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

Linf<- 746.66# cm TL
k<- 0.053# 
L0<- 140 #cm 
T0 <- -3.8 #years 
Combined <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)

# female 
idx_f<- which(meta_data$Sex == "F")
idx_m<- which(meta_data$Sex == "M")

male_t = sapply(TL[idx_m], function(x) {vbgf3(x, Males[["L_infinity_value"]], Males[["K_value"]], Males[["t_0_value"]])})
female_t = sapply(TL[idx_f], function(x) {vbgf3(x, Females[["L_infinity_value"]], Females[["K_value"]], Females[["t_0_value"]])})

# Cohort is equal to year sampled minus age
male_age = trunc(meta_data$modified_year[idx_m] -  male_t)
female_age = trunc(meta_data$modified_year[idx_f] -  female_t) # including some samples, ie. MBB1348 depends on how you round the age or the year
output = tibble::tibble(STRATA = c(male_age, female_age), MBB_CODE = c(meta_data$MBB_Code[idx_m], meta_data$MBB_Code[idx_f]))
readr::write_tsv(output, "Data/Processed/New_Cohort_Assignment.tsv")

cohort_assignment_file <- "Data/Processed/New_Cohort_Assignment.tsv"
cohort_assignment = readr::read_tsv(cohort_assignment_file) %>% #New cohort assignment
  separate(MBB_CODE, into = c("string1", "string2"), sep = " ", remove = FALSE) %>%
  unite(MBB_CODE_JOIN, string1, string2, sep = "_") 
  

#filtration happens here
source("./Scripts/filter_script.R")

#load filteed genotypes
gl_file <- readRDS('./Data/Processed/filtered_genotypes.RData')

gl_file$ind.names <- str_replace(gl_file$ind.names, "-", "_")

ind_order = tibble::tibble(MBB_CODE_JOIN = gl_file$ind.names, OLD_STRATA = gl_file@pop) %>%
  left_join(., cohort_assignment, by = "MBB_CODE_JOIN")

gl_file@pop <- as.factor(ind_order$STRATA)
dartR::gl2genepop(gl_file, outfile = "genotypes_for_Ne.gen", outpath = "Data/Processed/NeEstimator")

R Settings

R.version
##                _                           
## platform       x86_64-pc-linux-gnu         
## arch           x86_64                      
## os             linux-gnu                   
## system         x86_64, linux-gnu           
## status                                     
## major          4                           
## minor          3.3                         
## year           2024                        
## month          02                          
## day            29                          
## svn rev        86002                       
## language       R                           
## version.string R version 4.3.3 (2024-02-29)
## nickname       Angel Food Cake
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Brisbane
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] fstcore_0.9.18   readxl_1.4.3     here_1.0.1       radiator_1.3.0  
##  [5] nlme_3.1-164     gt_0.10.1        hierfstat_0.5-11 lubridate_1.9.3 
##  [9] forcats_1.0.0    stringr_1.5.1    purrr_1.0.2      readr_2.1.5     
## [13] tidyr_1.3.1      tibble_3.2.1     tidyverse_2.0.0  dartR_2.9.7     
## [17] dartR.data_1.0.2 dplyr_1.1.4      ggplot2_3.5.0    adegenet_2.1.10 
## [21] ade4_1.7-22     
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.3       later_1.3.2         fields_15.2        
##   [4] HardyWeinberg_1.7.5 R.oo_1.26.0         cellranger_1.1.0   
##   [7] mmod_1.3.3          rpart_4.1.23        lifecycle_1.0.4    
##  [10] Rdpack_2.6          doParallel_1.0.17   rprojroot_2.0.4    
##  [13] globals_0.16.3      Rsolnp_1.16         lattice_0.22-5     
##  [16] vroom_1.6.5         MASS_7.3-60         backports_1.4.1    
##  [19] magrittr_2.0.3      sass_0.4.9          rmarkdown_2.26     
##  [22] jquerylib_0.1.4     yaml_2.3.8          httpuv_1.6.14      
##  [25] gap_1.5-3           spam_2.10-0         sp_2.1-3           
##  [28] minqa_1.2.6         RColorBrewer_1.1-3  maps_3.4.2         
##  [31] R.utils_2.12.3      gdsfmt_1.38.0       PopGenReport_3.1   
##  [34] nnet_7.3-19         listenv_0.9.1       gdata_3.0.0        
##  [37] terra_1.7-71        vegan_2.6-4         parallelly_1.37.1  
##  [40] StAMPP_1.6.3        permute_0.9-7       codetools_0.2-19   
##  [43] dismo_1.3-14        xml2_1.3.6          tidyselect_1.2.1   
##  [46] shape_1.4.6.1       raster_3.6-26       farver_2.1.1       
##  [49] lme4_1.1-35.1       broom.mixed_0.2.9.4 jsonlite_1.8.8     
##  [52] fst_0.9.8           mitml_0.4-5         ellipsis_0.3.2     
##  [55] survival_3.5-8      iterators_1.0.14    foreach_1.5.2      
##  [58] tools_4.3.3         SNPRelate_1.36.1    Rcpp_1.0.12        
##  [61] glue_1.7.0          gridExtra_2.3       genetics_1.3.8.1.3 
##  [64] pan_1.9             xfun_0.42           mgcv_1.9-1         
##  [67] withr_3.0.0         combinat_0.0-8      fastmap_1.1.1      
##  [70] GGally_2.2.1        boot_1.3-30         fansi_1.0.6        
##  [73] digest_0.6.35       truncnorm_1.0-9     timechange_0.3.0   
##  [76] R6_2.5.1            mime_0.12           mice_3.16.0        
##  [79] colorspace_2.1-0    gtools_3.9.5        R.methodsS3_1.8.2  
##  [82] utf8_1.2.4          generics_0.1.3      calibrate_1.7.7    
##  [85] data.table_1.15.2   ggstats_0.5.1       pkgconfig_2.0.3    
##  [88] gtable_0.3.4        furrr_0.3.1         htmltools_0.5.7    
##  [91] dotCall64_1.1-1     scales_1.3.0        png_0.1-8          
##  [94] knitr_1.45          rstudioapi_0.16.0   tzdb_0.4.0         
##  [97] reshape2_1.4.4      nloptr_2.0.3        cachem_1.0.8       
## [100] parallel_4.3.3      pillar_1.9.0        grid_4.3.3         
## [103] vctrs_0.6.5         promises_1.2.1      jomo_2.7-6         
## [106] xtable_1.8-4        cluster_2.1.6       evaluate_0.23      
## [109] mvtnorm_1.2-4       cli_3.6.2           compiler_4.3.3     
## [112] rlang_1.1.3         crayon_1.5.2        labeling_0.4.3     
## [115] carrier_0.1.1       plyr_1.8.9          gap.datasets_0.0.6 
## [118] stringi_1.8.3       viridisLite_0.4.2   munsell_0.5.0      
## [121] glmnet_4.1-8        Matrix_1.6-5        hms_1.1.3          
## [124] patchwork_1.2.0     bit64_4.0.5         future_1.33.1      
## [127] gdistance_1.6.4     pegas_1.3           seqinr_4.2-36      
## [130] RgoogleMaps_1.5.1   shiny_1.8.0         highr_0.10         
## [133] rbibutils_2.2.16    igraph_2.0.3        broom_1.0.5        
## [136] bslib_0.6.2         bit_4.0.5           ape_5.7-1

References

Davenport, Danielle, Paul Butcher, Sara Andreotti, Conrad Matthee, Andrew Jones, and Jennifer Ovenden. 2021. “Effective Number of White Shark (Carcharodon Carcharias, Linnaeus) Breeders Is Stable over Four Successive Years in the Population Adjacent to Eastern Australia and New Zealand.” Ecology and Evolution 11 (1): 186–98.
Leberg, Paul. 2005. “Genetic Approaches for Estimating the Effective Size of Populations.” The Journal of Wildlife Management 69 (4): 1385–99.
Luikart, Gordon, Tiago Antao, Brian K. Hand, Clint C. Muhlfeld, Matthew C. Boyer, Ted Cosart, Brian Trethewey, Robert Al-Chockhachy, and Robin S. Waples. 2021. “Detecting Population Declines via Monitoring the Effective Number of Breeders (Nb).” Molecular Ecology Resources 21 (2): 379–93. https://doi.org/10.1111/1755-0998.13251.
O’Connor, J. 2011. “Age, Growth and Movement Signatures of the White Shark (Carcharodon Carcharias) in Southern Australia.” The University of Technology Sydney.
Wang, J. 2005. “Estimation of Effective Population Sizes from Data on Genetic Markers.” Philosophical Transactions of the Royal Society B: Biological Sciences 360 (1459): 1395–1409. https://doi.org/10.1098/rstb.2005.1682.
Wang, Jinliang. 2004. “Sibship Reconstruction from Genetic Data with Typing Errors.” Genetics 166 (4): 1963–79.
———. 2018. “Estimating Genotyping Errors from Genotype and Reconstructed Pedigree Data.” Methods in Ecology and Evolution 9 (1): 109–20.
Waples, Robin S. 2022. “What Is N e, Anyway?” Edited by William Murphy. Journal of Heredity 113 (4): 371–79. https://doi.org/10.1093/jhered/esac023.